Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
I look forward to learning a lot about R and data science.
my link is https://coolnesss.github.io/IODS-project/
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 3 15:24:19 2021"
The text continues here.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Fri Dec 3 15:24:19 2021"
Lets start by reading the data
analysis = read.csv('data/data.csv', header = TRUE)
print(str(analysis))
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## NULL
The data contains student questionnaire answers and attributes such as gender, age, etc. The variables describe their aptitudes, strengths and weaknesses, such as organization, reading styles, study styles, etc.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(analysis, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
The pairwise plot indicates that for example gender has little effect on Points, with the correlation also being small. On the other hand there is a large correlation between Points and Attitude which seems to suggest that people who were motivated to study got more points in the exam. The age distribution is highly skewed, with most respondents being close to the age of 20. Otherwise, most of the other columns roughly follow a Gaussian distribution.
Lets choose the following three attributes to fit our regression model on: Attitude, Age and Gender. In other words we are interested in knowing if we can explain the exam points using these three variables.
linear_regression = lm(Points ~ Age+Attitude+gender, data = analysis)
summary(linear_regression)
##
## Call:
## lm(formula = Points ~ Age + Attitude + gender, data = analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Age -0.07586 0.05367 -1.414 0.159
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
It would seem that indeed Attitude has a large coefficient meaning it contributes highly to the prediction. The significance test related to this predictor gives a very small value, meaning that it is unlikely that this result is by chance. On the other hand, while gender has a large negative coefficient, meaning being male reduces the amount of points recieved under the model, its corresponding p-value is high, meaning there is large variance in this result. Finally, age also has a moderately large p-value while also having a negligible effect.
Seeing that age and gender are not reliable predictors of exam scores, lets remove them from the model and try again with only attitude as the predictor.
linear_regression = lm(Points ~ Attitude, data = analysis)
summary(linear_regression)
##
## Call:
## lm(formula = Points ~ Attitude, data = analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now we can see that attitude is indeed a strong predictor of exam points and this result is statistically significant, i.e unlikely to be attributed to chance.
Finally lets interpret the multiple R-squared statistic of the model. An R-squared of only 0.19 means that only 19% of the data fit the model. Importantly, after taking out the two variables Age and Gender, our R-squared dropped by only 0.01 meaning they did not contribute much to fitting the data.
We also need to visualize the model using plots.
par(mfrow=c(2,2))
plot(linear_regression, which = c(1,2,5))
The first residuals vs fitted plot shows us that apart from the extreme values near 28, the residuals are evenly distributed, which means that there does not seem to be an underlying non-linearity that would affect the linearity assumption of our model. The Normal QQ plot shows that our sample data slightly skewed, since a non-skewed data would have a straight line plot. This suggests the data does not come from a normal distribution. Finally, the residuals vs leverage plot tells us how sensitive our prediction is to a change in the real exam points. We can also see if there are any datapoints for which their removal would cause a large change in the model. In this case there are no such points, since there is no visible dotted red line.
date()
## [1] "Fri Dec 3 15:24:23 2021"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data = read.csv('data/alc_data.csv')
print(str(data))
## 'data.frame': 370 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## NULL
print(max(data$age))
## [1] 22
print( min(data$age))
## [1] 15
The data consists of a questionnaire done to students of ages 15-22. Various questions about their background are asked as well as what they do on their free time, how much alcohol they consume and so forth. Also their grades are given here.
We will choose 4 variables to try to predict alcohol use using those variables. I will choose “failures” since failing a class might indicate alcohol use. In addition “freetime” will be chosen since people who have a lot of free time might be drinking. Age will also be chosen. Finally, I will choose “studytime” since people who study more dont have time to drink.
library(GGally)
library(ggplot2)
data$freetime = as.ordered(data$freetime)
data$studytime = as.ordered(data$studytime)
data$failures = as.ordered(data$failures)
data_subset = data[c('alc_use', 'freetime', 'studytime', 'age', 'failures')]
ggpairs(data_subset, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Here we have the relationship between the chosen variables and alcohol use. From the top row we can see that in particular the number of failures correlates with alchohol use: people with 3 failures had the most alcohol use. Age also seems to slightly correlate with alcohol use, with older students drinking more. People who study the most have the least amount of alcohol use. As for freetime, the connection is less clear. Therefore the chosen variables seem to be pretty good at describing alcohol use.
glm_data = data[c('high_use', 'freetime', 'studytime', 'age', 'failures')]
glm_data$failures = ordered(glm_data$failures)
glm_data$age = as.numeric(glm_data$age)
glm_data$studytime = as.ordered(glm_data$studytime)
glm_data$freetime = as.ordered(glm_data$freetime)
glm_model = glm(high_use ~ ., data=glm_data, family = "binomial")
glm_model
##
## Call: glm(formula = high_use ~ ., family = "binomial", data = glm_data)
##
## Coefficients:
## (Intercept) freetime.L freetime.Q freetime.C freetime^4 studytime.L
## -3.61634 1.13774 -0.37684 -0.01943 -0.16128 -0.98268
## studytime.Q studytime.C age failures.L failures.Q failures.C
## 0.37394 0.50550 0.17260 0.79327 -0.17619 0.34081
##
## Degrees of Freedom: 369 Total (i.e. Null); 358 Residual
## Null Deviance: 452
## Residual Deviance: 413.3 AIC: 437.3
Here we cast all the features into ordered values except age, since they have a clear ordering (freetime 5 is more than freetime 4)
print(exp(coef(glm_model)))
## (Intercept) freetime.L freetime.Q freetime.C freetime^4 studytime.L
## 0.02688092 3.11970354 0.68602398 0.98075575 0.85105503 0.37430663
## studytime.Q studytime.C age failures.L failures.Q failures.C
## 1.45344515 1.65781273 1.18838494 2.21061040 0.83845977 1.40609193
print(confint(glm_model))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -7.31978014 0.02252094
## freetime.L 0.15826962 2.40079162
## freetime.Q -1.45950625 0.47225107
## freetime.C -0.71317341 0.74441514
## freetime^4 -0.65580945 0.33331230
## studytime.L -1.87775566 -0.23173544
## studytime.Q -0.38618233 1.07809059
## studytime.C -0.08866794 1.15363489
## age -0.03309116 0.38217471
## failures.L -0.72327871 2.87497442
## failures.Q -1.44760231 1.45410711
## failures.C -0.67903331 1.40836278
Since we specified some of the features as having an order, R has fit each variable with a series of polynomial functions (L for linear, Q for quadratic and so on). Lets focus on the linear ones. We can interpret the exponentials of the coefficients as odds ratios for each feature. This means that Failures, for example, has an odds ratio of 2.2/1 meaning it is twice as likely that a person who failed a class uses a high amount of alchohol. Also, freetime seems to have a massive impact on high alcohol use as well, with people who have a lot of freetime being 3 times as likely to use a lot of alcohol. Perhaps somewhat surprisingly age doesnt play a big role.
alc = glm_data
probabilities <- predict(glm_model, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = alc$probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, age, freetime, studytime, high_use, probability, prediction) %>% tail(10)
## failures age freetime studytime high_use probability prediction
## 361 1 18 4 1 TRUE 0.7135619 TRUE
## 362 0 18 3 1 TRUE 0.4044013 FALSE
## 363 0 18 4 1 TRUE 0.5192315 TRUE
## 364 0 18 2 2 TRUE 0.2633834 FALSE
## 365 0 18 3 1 FALSE 0.4044013 FALSE
## 366 0 18 3 1 TRUE 0.4044013 FALSE
## 367 0 18 2 3 FALSE 0.1046941 FALSE
## 368 1 19 4 1 TRUE 0.7475035 TRUE
## 369 2 19 4 2 TRUE 0.6505777 TRUE
## 370 1 19 3 1 FALSE 0.6504956 TRUE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 93 18
We can see from the matrix that we predicted a TRUE value 18 times when it was actually true, and 15 times when it was actually false. On the other hand, we predicted false 244 times for false samples and 93 times for true samples. This indicates that the model is not very good, but still decent. The training error can be calculated
c_mat = table(high_use = alc$high_use, prediction = alc$prediction)
print('accuracy')
## [1] "accuracy"
print(1 - (c_mat[1, 2] + c_mat[2, 1]) / sum(c_mat))
## [1] 0.7081081
print('majority class')
## [1] "majority class"
print(mean(glm_data$high_use))
## [1] 0.3
Always guessing “False” would get 0.7 accuracy so we are barely better than that. I guess the model is bad after all.
date()
## [1] "Fri Dec 3 15:24:26 2021"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data = Boston
str(data)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The Boston housing dataset consists of different aspects of the towns or provinces in Boston. Theres crime numbers, tax rates, distances to different main places, ages of buildings and even the proportion of black people in the district. The main task is to predict the nitric oxide level in a part of the city.
library(GGally)
library(ggplot2)
data$chas = as.factor(data$chas)
ggpairs(data, mapping = aes(colour=chas), lower = list(combo = wrap("facethist", bins = 20)))
Here chas is a binary variable so I plotted it separately. The variables nox and age seem to be highly correlated, which indicates that the presence of older apartments increases nitrous oxide levels. In addition, distance to employment centers seems to be negatively correlated with the nitrous oxide levels, so areas closer to the center have less problems in this regard. Also curiously tax rate seems to be positively correlated with crime in the area.
library(dplyr)
colnames(data)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
boston_scaled <- data %>% mutate_at(c("crim", "zn" , "indus" , "nox" , "rm" , "age" , "dis" , "rad" , "tax" , "ptratio" ,"black" , "lstat", "medv"), ~(scale(.) %>% as.vector))
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
Here we did not scale the variable chas since its binary. Scaling the proportional values also seems odd but I did it anyway since the instructions asked. Now lets create the factor variable
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled$crime)
## Factor w/ 4 levels "[-0.419,-0.411]",..: 1 1 1 1 1 1 2 2 2 2 ...
## 75% of the sample size
smp_size <- floor(0.8 * nrow(boston_scaled))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(boston_scaled)), size = smp_size)
train <- boston_scaled[train_ind, ]
test <- boston_scaled[-train_ind, ]
print(nrow(train))
## [1] 404
print(nrow(test))
## [1] 102
Now we have partitioned the data into training and test sets. Next lets fit LDA
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus chas1 nox rm
## [-0.419,-0.411] 1.01866313 -0.9066422 0.04854369 -0.8835724 0.38666334
## (-0.411,-0.39] -0.07141687 -0.3429155 0.07920792 -0.5768545 -0.09882533
## (-0.39,0.00739] -0.40148591 0.2162741 0.11881188 0.3637157 0.12480815
## (0.00739,9.92] -0.48724019 1.0149946 0.06060606 1.0481437 -0.47733231
## age dis rad tax ptratio
## [-0.419,-0.411] -0.9213895 0.9094324 -0.6819317 -0.7458486 -0.4234280
## (-0.411,-0.39] -0.3254849 0.3694282 -0.5395408 -0.4205644 -0.1079710
## (-0.39,0.00739] 0.4564260 -0.3720478 -0.4349280 -0.3191090 -0.2716959
## (0.00739,9.92] 0.8274496 -0.8601246 1.6596029 1.5294129 0.8057784
## black lstat medv
## [-0.419,-0.411] 0.3729083 -0.766883775 0.47009410
## (-0.411,-0.39] 0.3103406 -0.164921798 0.01548761
## (-0.39,0.00739] 0.1049654 0.009720124 0.19440747
## (0.00739,9.92] -0.6383964 0.900379309 -0.71294711
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## chas1 0.009688321 0.15606070 0.6690362
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
I didn’t really understand the plot but lets move on. Next we will predict using LDA on the test set
true_crime = test$crime
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = true_crime, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 10 10 4 0
## (-0.411,-0.39] 2 17 6 0
## (-0.39,0.00739] 1 9 13 2
## (0.00739,9.92] 0 0 1 27
The different quantiles indicate different crime rates, with [-0.419,-0.411] being the smallest bin and (0.00739,9.92] being the highest one. We predicted the high crime areas very well, since only 1 was incorrectly classified when the true class was the highest crime rate. On the other hand in the smaller crime rates there are more mistakes which indicates they are not as easy to classify. Now lets reload boston dataset to run k-means
boston_scaled <- data %>% mutate_at(c("crim", "zn" , "indus" , "nox" , "rm" , "age" , "dis" , "rad" , "tax" , "ptratio" ,"black" , "lstat", "medv"), ~(scale(.) %>% as.vector))
# k-means clustering
km <-kmeans(boston_scaled, centers = 1)
# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)
Here we saw three different amounts of clusters. Clearly 1 cluster is not useful, but two seems to partition the data nicely in most cases. Three tends to just partition a fairly uniform group into two separate groups which does not feel useful. So here I would argue 2 is a good choice.
Regarding the clustering itself we can see that for example there are two clear clusters in the nox / tax plot which indicates that there is a set of outlier tax rates which are more or less the same.
date()
## [1] "Fri Dec 3 15:24:45 2021"
library(GGally)
library(ggplot2)
library('FactoMineR')
library('tidyr')
library(dplyr)
human = read.csv('http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt', sep=',')
dim(human)
## [1] 155 8
ggpairs(human, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
There is a major correlation between life expectancy and adolecent birth rate, as well as mortality rates and life expectancy: a higher life expectancy correlates with lower mortality rates, which is expected. Also, the gross national income correlates negatively with mortality and positively with live expectency. It seems there are a few countries with very high mortality and the rest are quite low judging from the distribution of the values.
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_std <- scale(human)
# print out summaries of the standardized variables
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
The results are very different because the scales of the different variables such as GNI were so different before that they skewed the PCA results. In this plot we see that the first principal component has high values in countries with large maternal mortality and adolescent birth rates, and low values on countries with high education expectency and life expectency. This means that for example Niger has low life expectency but high maternal mortality. On the other hand the second principal component correlates with high labour female to male ratio and females in parliament, which means taht for example Iran has a low amount of women working while Bolivia has a high amount.
library('FactoMineR')
library('tidyr')
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(keep_columns)` instead of `keep_columns` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"))
It seems that the variables ‘other’ and chain store+tea shop and most correlated with dimension 2, while tea shop and unpackaged are most correlated with dimension 1.
(more chapters to be added similarly as we proceed with the course!)